{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Value function\n",
    "\n",
    "The value function also called the state value function denotes the value of the state. The value of a state is the return an agent would obtain starting from that state following the policy $\\pi$. The value of a state or value function is usually denoted by $V(s)$ and it can be expressed as: \n",
    "\n",
    "$$ V^{\\pi}(s) = [R(\\tau) | s_0 = s] $$\n",
    "\n",
    "where, $ s_0 = s $ implies that the starting state is $s_0$. The value of a state is called the state value. \n",
    "\n",
    "Let's understand the value function with an example. Let's suppose we generate the trajectory $\\tau$ following some policy $\\pi$ in our grid world environment as shown below:\n",
    "\n",
    "\n",
    "![title](Images/28.png)\n",
    "\n",
    "Now, how do we compute the value of all the states in our trajectory? We learned that the value of a state is the return(sum of reward) an agent would obtain starting from that state following the policy $\\pi$. The above trajectory is generated using the policy $\\pi$, thus we can say that the value of a state is the return(sum of rewards) of the trajectory starting from that state. \n",
    "\n",
    "* The value of the state A is the return of the trajectory starting from the state A. Thus, $V(A) = 1+1+-1+1 = 2 $\n",
    "* The value of the state D is the return of the trajectory starting from the state D. Thus, $V(D) = 1-1+1 = 1 $\n",
    "* The value of the state E is the return of the trajectory starting from the state E. Thus, $V(E) = -1+1 = 0 $\n",
    "* The value of the state H is the return of the trajectory starting from the state H. Thus, $V(H) = 1$\n",
    "\n",
    "\n",
    "What about the value of the final state I?  We learned the value of a state is return (sum of rewards) starting from that state. We know that we obtain a reward when we transition from one state to another. Since I is the final state, we don't make any transition from final state and so there will no reward and thus no value for the final state I. \n",
    "\n",
    "_In a nutshell, the value of a state is the return of the trajectory starting from that state._\n",
    "\n",
    "\n",
    "Wait! there is a small change here, instead of taking the return directly as a value of a state we will use the expected return. Thus, the value function or the value of the state $s$ can be defined as the expected return that the agent would obtain starting from the state $s$ following the policy $\\pi$ . It can be expressed as:\n",
    "\n",
    "\n",
    "$$V^{\\pi}(s) = \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = s] $$\n",
    "\n",
    "\n",
    "\n",
    "Let's understand this with a simple example. Let's suppose we have a stochastic policy $\\pi$.  We learned that, unlike deterministic policy which maps the state to action directly, stochastic policy maps the state to the probability distribution over action space. Thus, the stochastic policy selects actions based on a probability distribution.\n",
    "\n",
    "Let's suppose we are in state A and the stochastic policy returns the probability distribution over the action space as [0.0,0.80,0.00,0.20]. It implies that with the stochastic policy, in the state A, we perform action down for 80% of the time, that is, $\\pi(\\text{down} |A) = 0.80 $ and the action right for the 20 % of the time, that is $\\pi(\\text{right} |A) = 0.20 $ .\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Thus, in state A, our stochastic policy  $\\pi$ selects action down for 80% of time and action right for 20% of the time and say our stochastic policy selects action right in the state D and E and action down in the state B and F for the 100% of the time.\n",
    "\n",
    "First, we generate an episode $\\tau_1$ using our given stochastic policy $\\pi$ as shown below:\n",
    "\n",
    "\n",
    "![title](Images/29.PNG)\n",
    "\n",
    "For better understanding, let's focus only on the value of state A. The value of the state A is the return (sum of rewards) of the trajectory starting from the state A. Thus, $V(A) = R(\\tau_1) = 1+1+1+1 =4 $\n",
    "\n",
    "Say, we generate another episode $\\tau_2$ using the same given stochastic policy $\\pi$ as shown below:\n",
    "\n",
    "\n",
    "![title](Images/30.PNG)\n",
    "\n",
    "The value of the state, A is the return (sum of rewards) of the trajectory from the state A. Thus, $V(A) = R(\\tau_2) = -1+1+1+1 =2 $\n",
    "\n",
    "As you may observe, although we use the same policy, the value of state A in the trajectory $\\tau_1$ and $\\tau_2$ are different. This is because our policy is a stochastic policy and it performs an action down in state A for 80% of time and action right in state A for 20% of the time. So, when we generate a trajectory using the policy $\\pi$, the trajectory  $\\tau_1$ will occur for 80% of the time and the trajectory $\\tau_2$ will occur for 20% of the time. Thus, the return will be 4 for 80% of the time and 2 for 20% of the time. \n",
    "\n",
    "Thus, instead of taking the value of the state as a return directly we will take the expected return since the return takes different values with some probability. The expected return is basically the weighted average, that is, the sum of the return multiplied by their probability. Thus, we can write:\n",
    "\n",
    "\n",
    "$$ V^{\\pi}(s) = \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = s] $$\n",
    "\n",
    "\n",
    "\n",
    "The value of a state A can be obtained as:\n",
    "$$\\begin{align}V^{\\pi}(A) &= \\underset{\\tau \\sim \\pi}{\\mathbb{E}}[R(\\tau) | s_0 = A] \\\\ &= \\sum_i R(\\tau_i) \\pi(a_i|A) \\\\ &= R(\\tau_1) \\pi (\\text{down}|A) + R(\\tau_2) \\pi (\\text{right}|A)\\\\ &= 4(0.8) + 2(0.2) \\\\ &= 3.6 \n",
    "\\end{align} $$\n",
    "\n",
    "\n",
    "Thus, the value of a state is the expected return of the trajectory starting from that state.\n",
    "\n",
    "Note that the value function depends on the policy, that is, the value of the state varies based on the policy we choose. There can be many different value functions according to different policies. The optimal value function,  $ V^*(s) $ is the one which yields maximum value compared to all the other value functions. It can be expressed as:\n",
    "\n",
    "$$V^{*}(s) = \\max_{\\pi} V^{\\pi}(s) $$\n",
    "\n",
    "\n",
    "\n",
    "For example, let's say we have two policies $\\pi_1$ and $\\pi_2$. Let the value of a state  using the policy $\\pi_1$  be $V^{\\pi_1} (s) = 13 $ and the value of the state  using the policy $\\pi_2$ be $V^{\\pi_2} (s) = 11 $.  Then the optimal value of the state  will be  $V^*(s) = 13 $ as it is the maximum. The policy which gives the maximum state value is called optimal policy $\\pi^*$. Thus, in this case, $\\pi_1$ is the optimal policy as it gives the maximum state value. \n",
    "\n",
    "We can view the value function in a table and it is called a value table. Let us say we have two states $s_0$ and  $s_1$ then the value function can be represented as:\n",
    "\n",
    "\n",
    "![title](Images/31.PNG)\n",
    "\n",
    "From the above value table, we can tell that it is good to be in the state $s_1$ than the state $s_0$  as $s_1$ has high value. Thus, we can say that the state $s_1$ is an optimal state than the state $s_0$ . \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q function\n",
    "\n",
    "\n",
    "A Q function also called the state-action value function denotes the value of a state-action pair. The value of a state-action pair is the return agent would obtain starting from the state  and an action  following the policy . The value of a state-action pair or Q function is usually denoted by  and it is called Q value or state-action value. It is expressed as: \n",
    "\n",
    "\n",
    "![title](Images/32.PNG)\n",
    "\n",
    "Note that the only difference between the value function and Q function is that in the value function we compute the value of a state whereas in the Q function we compute the value of a state-action pair. Let's understand the Q function with an example. Consider the below trajectory generated using some policy :\n",
    "\n",
    "\n",
    "\n",
    "We learned that Q function computes the value of a state-action pair. Say we need to compute the Q value of state action pair, A-down. That is the Q value of performing action down in the state A. Then the Q value will be just the return of our trajectory starting from the state A and performing action down as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Let's suppose we need to compute the Q value of the state action pair, D-right. That is the Q value of performing action right in the state D. Then Q value will be just the return of our trajectory starting from the state D and performing action right as shown below:\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Similarly, we can compute the Q value for all the state-action pairs. Similar to what we learned in the value function, instead of taking the return directly as the Q value of a state-action pair we will use the expected return because the return is the random variable and it takes different values with some probability. So we can redefine our Q function as:\n",
    "\n",
    "\n",
    "\n",
    "It implies that the Q value is the expected return agent would obtain starting from the state  and action  following the policy \n",
    "\n",
    "Similar to value function, the Q function depends on the policy, that is, the Q value varies based on the policy we choose. There can be many different Q functions according to different policies. Optimal Q function is the one which has the maximum Q value over other Q functions and it can be expressed as:\n",
    "\n",
    "\n",
    "\n",
    "The optimal policy  is the policy which gives the maximum Q value. \n",
    "\n",
    "Like value function, Q function can be viewed in a table. It is called a Q table. Let us say we have two states  and  and two actions 0 and 1, then the Q function can be represented as follows:\n",
    "\n",
    "\n",
    "![title](Images/33.PNG)\n",
    "\n",
    "As we can observe, the above Q table represents the Q-value of all possible state-action pairs. We can extract the policy from the Q function by just selecting the action which has the maximum Q value in each state:\n",
    "\n",
    "\n",
    "\n",
    "Thus, our policy will select action 1 in the state  and action 0 in the state  since they have a maximum Q value as shown below:\n",
    "\n",
    "![title](Images/34.PNG)\n",
    "\n",
    "\n",
    "Thus, we can extract the policy by computing the Q function. \n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
